Exploring Analytics-Related Occupations: A Data Mining Perspective on US Labor Market Insights for Analytics Students and Graduates¶

Abstract:
This study presents an in-depth analysis of analytics-related occupations within the US labor market, focusing on providing valuable insights to students. Utilizing data mining techniques such as clustering and classification, we examine the trends and statistics of these occupations and offer descriptive analytics, including exploratory data analysis (EDA) and clustering. Furthermore, predictive analytics is employed to forecast job postings and other relevant information, enabling students to make informed decisions about their career paths. By analyzing the analytics-related job landscape, this research contributes to informing students about the diverse opportunities available and guiding their professional journey.

Keywords: analytics-related occupations, US labor market, data mining techniques, clustering, classification, descriptive analytics, predictive analytics, exploratory data analysis (EDA), job postings, student insights.

1. Introduction¶

1.1 Introduction into US Labor Market Landscape¶

Exploring analytics-occupations on US Labor Market requires the knowledge on how the labor market are organized in the United States along with its supporting infrastructure and ecosystem. There are three components that support the ecosystem of the labor market in the United States which are:

  1. Industry Classification: Various industries in the United States are classified into several categories based on their characteristics and nature, and the classifi-cation for these industries adheres to the North American Industry Classification System. The North American Industry Classification System (NAICS) is a stand-ardized system used by Federal statistical agencies in the US to classify business establishments, gather statistical data, and analyze the US business economy. NAICS divides the economy into 20 sectors. Industries within these sectors are grouped according to the production criterion

  2. Occupation Classification: The Standard Occupational Classification (or SOC) system is a federal statistical standard employed by federal agencies for classifying workers into occupational categories, aiming to collect, compute, or publish data. Workers are assigned to one of 867 distinct occupations based on their de-scription. Detailed occupations are combined to create 459 broad occupations, 98 minor groups, and 23 major groups, making classification more efficient. Occupations with similar job responsibilities, and sometimes comparable skills, education, and/or training, are grouped in the SOC

  3. Skills Classification. There are no official skill classifications from the US gov-ernment but Lightcast, one of the leading labor market aggregators and analytics, has suggested skill classification and definition. The Open Skills Library by Lightcast defines skills as abilities related to particular tasks or knowledge of specific subjects and tools obtained through education or experience. The library categorizes each skill as specialized, common, or certifications.
    Specialized Skills, also known as technical skills or hard skills, are competen-cies that are mostly necessary within a specific occupation or enable an individ-ual to carry out a particular task. Examples of specialized skills include "NumPy" or "Hotel Management".
    Common Skills refer to the skills widely used across various occupations and industries, encompassing both learned skills and personal attributes. These skills may include "Communication" or "Microsoft Excel" and are also known as com-petencies, soft skills, or human skills.
    Certifications refer to qualifications that are recognized by industry or educa-tional bodies, such as a "Cosmetology License" or a "Certified Cytotechnologist" designation. These certifications indicate that the individual has achieved specif-ic knowledge or expertise in a particular field or skill.
    Software Skills refer to the software proficiency that the employer sought after for the particular occupations.
    Please note that one occupation could have several job titles under its umbrella, for example, occupation of "Data Scientists" could have job titles posted by the employer such as "Data Scientists", "Data Analysts", and "Information Analysts". You could explore various job titles for various occupations later below in this report.

1.2 Data Source¶

The data used for this research comes primarily from lightcast.io, the world leading Labor Market aggregator and analytics. The data were queried from Lightcast database for analytics-related occupations dating back to 2014 up to 2023. Data queried include but not limited to number of job posting for analytics-related occupations per year, salary data, skill and qualification data, and education data. The data queried consider the parameter for the recent graduate with 0 to 3 years of professional experience.

Another data source that was being used for this research comes from United States Bureau of Labor Statistics for the reference of the SOC data.

Below are the website to each of the aforementioned data-sources:

  1. Lightcast: https://lightcast.io/
  2. US Bureau of Labor Statistics: https://www.bls.gov/soc/

1.3 Data Cleaning and Manipulation¶

The dataset used in this analysis were queried from the above sources and then went through cleaning and manipulation process to make the data fit for the analysis. The process of cleaning and manipulating were assisted by python script developed to automate the process. The resulting data that fits for analysis comprised of two different grouping:

  1. General Occupation Dataset : comprised of job posting, salary and wage information, skill and qualification information, experience and education breakdown, job posting location, job titles all aggregated from all analytics-related occupations from year 2014 to 2023.
  2. Specific Occupation Dataset: the same kind of information dataset but specifically acquired for each of the analytics-related occupation. This dataset are curated only for the top 10 occupations for each year.

2. Analysis Preparation¶

2.1 Importing Relevant Packages and Setting Jupyter Notebook Output¶

In [1]:
import warnings
import pandas as pd
import numpy as np
import gdown
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates
from IPython.display import display, HTML, IFrame, clear_output
import IPython.core.display as di
import ipywidgets as widgets
import seaborn as sns
import scipy.stats as stats
from scipy.stats import shapiro
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.api import SimpleExpSmoothing, ExponentialSmoothing
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn import metrics
from sklearn.cluster import AgglomerativeClustering
from itertools import product
import dash
from dash import dcc
from dash import html
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Output Display Setting
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
warnings.filterwarnings("ignore", category=UserWarning)

2.2 Importing Datasets¶

The data for the analysis were previously extracted from source website and are currently hosted in a google drive cloud storage. In order to download all necessary data, below function will be used.

In [2]:
def download_files(file_urls, file_names):
    for url, name in zip(file_urls, file_names):
        print(f"Downloading {name}...")
        gdown.download(url, name, quiet=False)
        print(f"Downloaded {name} successfully.")
In [3]:
%%capture

file_urls = [
    "https://docs.google.com/spreadsheets/d/1sC812NjmZs9ICGBMlLGqw3vG_tMWJc7R/export?format=csv", #advertised_salary
    "https://docs.google.com/spreadsheets/d/1qXIc9W0sssVbLib5alYeKdZwNKp9jkcy/export?format=csv", #advertised_wage
    "https://docs.google.com/spreadsheets/d/1jM4paJv2jb1i3r82Bsv0BYSzV1qdTheq/export?format=csv", #location
    "https://docs.google.com/spreadsheets/d/1AWbFo-T-MUGuN6oeeyvQ6CK-n_pY4eWd/export?format=csv", #exec_summ
    "https://docs.google.com/spreadsheets/d/14WpHz-BttR_ueWSEy5R-n3bjB514HAcz/export?format=csv", #exp_breakdown
    "https://docs.google.com/spreadsheets/d/1m7LpQ76nUlZiFd2Svb9HhlaoE2XUn7Py/export?format=csv", #job_posting_samples
    "https://docs.google.com/spreadsheets/d/12oLAhLfekEk8xsqeSMeqcbau6fQiD26i/export?format=csv", #job_posting_sites
    "https://docs.google.com/spreadsheets/d/1b1FJClS9Fp325L7BnupS4pkdYmGyznWO/export?format=csv", #job_titles
    "https://docs.google.com/spreadsheets/d/1eIo9zSgfl5Sd8DLUBJiZi2f6__L53JUp/export?format=csv", #job_posting_timeseries
    "https://docs.google.com/spreadsheets/d/1MyzRgoQa-Xvl-RCPYM1U3vtsEruNmEu5/export?format=csv", #top_companies
    "https://docs.google.com/spreadsheets/d/1CpI4xUIkbVGX1nP89hsPxkTkILkgPjT5/export?format=csv", #top_industries
    "https://docs.google.com/spreadsheets/d/1jVc7ZnvM1Nqa3n79zp4E-Sc0krOBCaXe/export?format=csv", #occupation_onet
    "https://docs.google.com/spreadsheets/d/1ta9J7Y6MQBJHB5GIgwicqdhcSHEtN_gs/export?format=csv", #occupation_soc
    "https://docs.google.com/spreadsheets/d/1eJKh96aTU5A7RMDlmjiYdo3YhumDCjVI/export?format=csv", #common_skills 
    "https://docs.google.com/spreadsheets/d/1ngCEHAGAOX3WZSmEgWqAeRRJ4bUoYsCP/export?format=csv", #qualification
    "https://docs.google.com/spreadsheets/d/1C6mbUAggVXFN6UZ0er3ZuIt84vWxEJpz/export?format=csv", #software_skills
    "https://docs.google.com/spreadsheets/d/1Yl0ji6GWACtdCcFqD5M5cWGgQ-F0zJHU/export?format=csv", #specialized_skills
    "https://docs.google.com/spreadsheets/d/14VjdBk2WupDVVRAAIsp2MWDlKeLIy62lKkl-08rVmR4/export?format=csv", #aggregated_salary
    "https://docs.google.com/spreadsheets/d/1gVnpT77gayE6_Ako1V8A5cOqJXWubE0r/export?format=csv", #agg_adv_salary
    "https://docs.google.com/spreadsheets/d/1014BO8xGzs4-ungnOS6WO4qKgcZyKQSLkg_WlDpZQtA/export?format=csv", #occupation_salary
    "https://docs.google.com/spreadsheets/d/1931ghwDf3wG9Tip76_RyI_9bGbAhPOhG/export?format=csv", #agg_exec_summ
    "https://docs.google.com/spreadsheets/d/1CrqtKkrGyxt0WEW1PRoz5YgZQ65U-mKk/export?format=csv" #clustering set
    
    
]
file_names = [
    "advertised_salary.csv",
    "advertised_wage.csv",
    "top_cities.csv",
    "exec_summ.csv",
    "exp_breakdown.csv",
    "job_posting_samples.csv",
    "job_posting_sites.csv",
    "job_titles.csv",
    "job_posting_timeseries.csv",
    "top_companies.csv",
    "top_industries.csv",
    "occupation_onet.csv",
    "occupation_soc.csv",
    "common_skills.csv",
    "qualification.csv",
    "software_skills.csv",
    "specialized_skills.csv",
    "aggregated_salary.csv",
    "agg_adv_salary.csv",
    "occupation_salary.csv",
    "agg_exec_summ.csv",
    "clustering.csv"
]

download_files(file_urls, file_names)
In [4]:
advertised_salary = pd.read_csv("advertised_salary.csv")
advertised_wage = pd.read_csv("advertised_wage.csv")
exec_summ = pd.read_csv("exec_summ.csv")
exp_breakdown = pd.read_csv("exp_breakdown.csv")
job_posting_samples = pd.read_csv("job_posting_samples.csv")
job_posting_sites = pd.read_csv("job_posting_sites.csv")
job_titles = pd.read_csv("job_titles.csv")
job_posting_timeseries = pd.read_csv("job_posting_timeseries.csv")
top_cities = pd.read_csv("top_cities.csv")
top_companies = pd.read_csv("top_companies.csv")
top_industries = pd.read_csv("top_industries.csv")
occupation_onet = pd.read_csv("occupation_onet.csv")
occupation_soc = pd.read_csv("occupation_soc.csv")
common_skills = pd.read_csv("common_skills.csv")
qualifications = pd.read_csv("qualification.csv")
software_skills = pd.read_csv("software_skills.csv")
specialized_skills = pd.read_csv("specialized_skills.csv")
aggregated_salary = pd.read_csv("aggregated_salary.csv")
agg_adv_salary = pd.read_csv("agg_adv_salary.csv")
occupation_salary = pd.read_csv("occupation_salary.csv")
agg_exec_summ = pd.read_csv("agg_exec_summ.csv")
clustering=pd.read_csv("clustering.csv")

Setting Plotly Display Configuration

In [5]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

Exploratory Data Analysis (EDA)¶

In this section, we will dive into the historical data of analytics-related occupations for the last 10 years (2014 - 2023) and gain some insights about how the occupations have changed over the years.

1. Top 10 Occupations for Analytics Graduate

In [6]:
enable_plotly_in_cell()

top_10_occupations = occupation_soc.groupby(['Year', 'Occupation (SOC)']).sum()
top_10_occupations = top_10_occupations.groupby('Year').apply(lambda x: x.nlargest(10, 'Unique Postings'))
pivot_table = top_10_occupations.pivot_table(index='Occupation (SOC)', columns='Year', values='Unique Postings')

fig = go.Figure()

for year in pivot_table.columns:
    fig.add_trace(go.Bar(
        name=str(year),
        y=pivot_table.index,
        x=pivot_table[year],
        orientation='h',
        hovertemplate='%{y}: Year ' + str(year) + '<br>Unique Postings: %{x}'
    ))

fig.update_layout(
    barmode='stack',
    title='Top 10 Occupations by Year',
    xaxis=dict(title='Unique Postings'),
    yaxis=dict(title='Occupation'),
    legend=dict(title='Year')
)

fig.show()

Throughout 2014 to 2023, there are 16 occupations that ranked within the top 10 most posted occupations related to the analytics graduate, which are (alphabetically):

  1. Accountants and Auditors
  2. Computer Occupations, All Other
  3. Computer Systems Analysts
  4. Data Scientists
  5. Database Administrators
  6. Database Architects
  7. Financial and Investment Analysts
  8. Human Resources Specialists
  9. Management Analysts
  10. Managers, All Other
  11. Market Research Analysts and Marketing Specialists
  12. Marketing Managers
  13. Operations Research Analysts
  14. Personal Financial Advisors
  15. Sales Managers
  16. Software Developers

Now, lets take a look at how the above 16 occupations job posting evolved from year to year. Please note that you could actually play the animation on the right dashboard by clicking the "play" button if the animation does not start automatically.

In [7]:
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
url1 = "https://app.powerbi.com/view?r=eyJrIjoiZmNiMmEwMjAtZTgxMS00YjZjLTk5YzAtMTg1NmY3YjYwNTYyIiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
url2 = "https://app.powerbi.com/view?r=eyJrIjoiM2QzMTgxNDEtN2IwZS00M2RjLTlkODYtZjEwYTIxNzBmZTU5IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = f'''
<div style="display: flex; justify-content: center;">
    <div style="display: inline-block;">
        <iframe title="URL 1" src="{url1}" width="600" height="373.5" frameborder="0" allowfullscreen="true" style="margin: 0 auto; max-width: 100%;"></iframe>
    </div>
    <div style="display: inline-block;">
        <iframe title="URL 2" src="{url2}" width="600" height="373.5" frameborder="0" allowfullscreen="true" style="margin: 0 auto; max-width: 100%;"></iframe>
    </div>
</div>
'''
display(HTML(html_code))

Now lets drill a bit deeper into descriptive statistics and timeseries posting for each of the 16 occupations above

In [8]:
enable_plotly_in_cell()

occupation_stats = occupation_soc.groupby('Occupation (SOC)')['Unique Postings'].describe()

occupations_filter = ['Accountants and Auditors', 'Computer Occupations, All Other', 'Computer Systems Analysts',
                      'Data Scientists', 'Database Administrators', 'Database Architects',
                      'Financial and Investment Analysts', 'Human Resources Specialists', 'Management Analysts',
                      'Managers, All Other', 'Market Research Analysts and Marketing Specialists',
                      'Marketing Managers', 'Operations Research Analysts', 'Personal Financial Advisors',
                      'Sales Managers', 'Software Developers']

filtered_stats = occupation_stats.loc[occupations_filter]

def show_stats_and_plot(occupation):
    stats_table = filtered_stats.loc[occupation].reset_index()
    display(stats_table)
    
    occupation_data = occupation_soc[occupation_soc['Occupation (SOC)'] == occupation]
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=occupation_data['Year'], y=occupation_data['Unique Postings'], mode='lines+markers'))
    fig.update_layout(title=f'Unique Postings for {occupation} over Time', xaxis_title='Year', yaxis_title='Unique Postings')
    fig.show()

occupation_dropdown = widgets.Dropdown(options=occupations_filter, description='Occupation:')

output = widgets.interactive_output(show_stats_and_plot, {'occupation': occupation_dropdown})

display(occupation_dropdown, output)
Dropdown(description='Occupation:', options=('Accountants and Auditors', 'Computer Occupations, All Other', 'C…
Output()

From the descriptive statistics above, Data Scientists top the average yearly job posting with about 4478 job posting per year followed by Management Analyst at 2734 and Software Developer at 1004. All of the occupations has relatively high standard deviation compared to its mean due to the fact the the job postings were on an increasing trend over the years as shown by the timeseries chart. We will come back to this topic later in this report when discussing the time series analysis of the job posting data.

2. Most Popular Job Title for Each Occupations

As previously mentioned, each occupations may have more than one job titles associated with it as the occupations is the standard used by the federal government to classify the job in United States and the Job Titles is the one that is being used by the employer.

The employer attempted to match their job title with the occupation standard defined by the federal government, that is why the one to many relationship between Occupations and the Job Title occurred.

Below is the interactive visualization of the most popular job title for each occupations year after year. You could click on the occupations slicer on the bottom of the visualization to check for the job title for each occupations.

In [9]:
url = "https://app.powerbi.com/view?r=eyJrIjoiMDE2MDNhMTYtOWY0My00M2I5LTg1ZGItM2JjNDJmYjNhODM4IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

It is interesting to note that for all occupations, one of the most popular job title was "Analyst", be it Business Analyst, Data Analyst, Financial Analyst, Marketing Analyst, Research Analyst, and Business System Analysts, among many others. For some occupations, the most popular job titles were dynamically changing overtime while for some occupations, for example Data Scientist, Data Scientists and Data Analyst constantly tops the chart for the last 10 years.

This information can be used as standalone insight to inform you about which job titles that are highly on demand by employer thus giving you preliminary information about how tough the competition would be for the most popular and most on demand job titles. This information could also be used to check for the specific skills that each of the job titles requires under the Occupation umbrellay. The required specific skills will be elaborated further below.

3. Most Sought-after Skills for Each Occupations

As previously mentioned, there are 4 types of skill that is classified by Lightcast in terms of the occupations job posting requirement. Below is the interactive visualization that could be adjusted to check for each of the 16 occupations required skills, be it Common, Specialized, Software, and Qualification skill.

In [10]:
url = "https://app.powerbi.com/view?r=eyJrIjoiMGVkN2QxZTUtMzI1ZC00NDQzLWFkNGEtOGZmMWU3YmIyMzY4IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

There are several interesting takeaways from the above visualizations:

  1. Software Skills: all occupations put emphasize on the ability of the future employee to have proficiency in three types of programming language (SQL, Python, and R), Microsoft Office Suite (Ms Excel, Powerpoint, Word, Access), and Business Intelligence Tools (Tableau and Power BI).

  2. Specialized Skills: while specialized skills vary from one occupations to another, there are some common underlying skills that each of the occupations require which is "Analysis", be it Data analysis, Financial Analysis, Business Analysis and other domains of analysis. It suggests that the employer actually sought after the "Analysis" skill very highly regardless of the domain specific knowledge.

  3. Common Skills: There are two obvious skills that is required for almost all of the occupations which are "Communication" and "Management". Communication is commonly known as one of the top enabler in every aspect of life, including business setting and it comes with no surprise that Communication skill is the most sought after common skill by every employer.

  4. Qualification: while qualification vary from one occupations to another, but there is one qualification that is frequently occurring for all of the top 16 analytics-related occupations which is Master of Business Administration. While qualification is "nice to have" skills but having an MBA might have amplified the change of acceptance as stipulated by the Qualification data.

4. Salary Range for Each Occupations

Salary information that were being advertised on the hosting website has numerous range category and very diverse, even for each specific occupations. In order to present a comprehensive view of the available information, two charts in the interactive visualization below were developed. The tree diagram showed the frequency of each range of advertised salary occurring while the line chart to the right showed the mean, min, and max advertised salary over the year.

On the bottom, there is a slicer to individually check the salary range for each of the occupations previously mentioned.

In [11]:
url = "https://app.powerbi.com/view?r=eyJrIjoiNzg2YThiMDktMjcwZS00YzZkLWIxNjMtODJmNzM0Nzg4NmQ1IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

To make comparison of advertised salary more clear, box plot of each of the occupations posted salary was also developed below.

In [12]:
enable_plotly_in_cell()

salary = advertised_salary[['Occupations', 'Year','Mean Advertised Salary']]
salary['Mean Advertised Salary'] = salary['Mean Advertised Salary'].str.replace('$', '').str.replace(',', '').astype(float)
fig = px.box(salary, x='Occupations', y='Mean Advertised Salary', title='Box Plot of Mean Advertised Salary by Occupations')
fig.update_layout(xaxis_title='Occupations', yaxis_title='Mean Advertised Salary')
fig.show()

From the boxplot above we could see that the median posted salary for each of the top occupations does not seem to be significantly different from each other. We could also see that Data Scientists, Database Administrators, and Computer Occupations All Other being the occupations with the highest median and mean salary while also has the broadest range of pay. Management Analysts also exhibited couples of outliers observation with the highest salary could reach 310,000 USD per year.

5. Industry Employing Each Occupations

There are several industries that participates in the job hiring through the job posting. The top 10 industries posted analytics-related occupations are displayed in the interactive visualization below.

Please note that the industry classification here refers to the NAICS mentioned previously.

In [13]:
url = "https://app.powerbi.com/view?r=eyJrIjoiYTIwZDUxY2UtN2JmNi00OGQ1LTg3ZjctMDJiNzBmYTNmYmIzIiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

From the above visualizations, we could see that, aggregated by all of the top occupations, there are 10 industries that posted the highest number of job posting throughout the year with Direct Health and Medical Insurance Carriers and Employment Placement Agencies being the most active industries posting the analytics-related occupations. The ranking of the two industry seems to not change very much from year to year as they both top the rank in each of the year.

6. Company Employing Each Occupations

List of top 10 companies that posted the highest number of job posting for analytics-related occupations are shown in below interactive visualizations.

In [14]:
url = "https://app.powerbi.com/view?r=eyJrIjoiYzNkNDM1ZmQtNjI0Ny00OTFkLThlMWItODBmZTc2ZGEwYzk2IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

From the visualization above we could easily distinguish that Elevance Health ranked the first as the company with the highest job postings followed by Randstad, Ryder, Amazon, and UnitedHealth Group. This finding aligns with the previous insight on the hiring industry in which Direct Health and Medical Insurance Carriers and Employment Placement Agencies being the most active industries posting the analytics-related occupations.

Elevance Health, Inc. is an American health insurance provider that could be accessed at https://www.elevancehealth.com/ while Randstad NV, commonly known as Randstad and stylized as randstad, is a Dutch multinational human resource consulting firm headquartered in Diemen, Netherlands that could be accessed at https://www.randstadusa.com/

7. Job Posting Websites with The Most Number of Postings

There are several websites that hosted the job postings from employer from various industries. Below interactive visualizations show the top 10 job posting websites and its number of posting over the year.

In [15]:
url = "https://app.powerbi.com/view?r=eyJrIjoiNDIyOGRiOTAtNTNlMC00NzM3LTg5ZDMtOWUxMjU0NGQ5NGM1IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

dejobs.org tops the list with the highest number of job posted in that particular website followed by indeed.com and simplyhired.com. There seems to be no specific inclination for each of the website to host only specific type of occupations.

8. Cities with The Most Job Postings

Job posting for analytics-related occupations occurred at numerous cities all around United States. Below interactive visualizations will take you to the cities with the most job postings for analytics-related occupations.

In [16]:
url = "https://app.powerbi.com/view?r=eyJrIjoiOTljYmU4N2ItYTRiNy00MjVlLTlhZWYtZmMyZjU4MmM1ZjY1IiwidCI6ImQ1N2QzMmNjLWMxMjEtNDg4Zi1iMDdiLWRmZTcwNTY4MGM3MSIsImMiOjN9"
html_code = """
<div style="display: flex; justify-content: center;">
    <iframe width="800" height="500" src="{url}" frameborder="0" allowFullScreen="true"></iframe>
</div>
"""
report_html = html_code.format(url=url)
display(HTML(report_html))

From the chart above, we could see that New York hosted the most job posting for analytics-related occupations followed by Chicago, Atlanta, and San Fransisco. New York and Chicago tops the chart for almost the last 10 years for the majority of the occupations, making these two cities quite attractive for the analytics graduate looking for job.

Time Series Analysis : Exploratory¶

Now, lets dive deeper into the time-series analysis of the analytics-related occupations data. We will first begin with the job posting data in which we are interested to know whether there are specific trends or seasonality with the job posting posted by the employer throughout the year.

1. Number of Job Posting Time Series

In [17]:
# Preparing the Dataset
job_posting = job_posting_timeseries.loc[job_posting_timeseries['Year'] == 2014, ['Month', 'Unique Postings']]
job_posting['Month'] = pd.to_datetime(job_posting['Month'], format='%b %Y')
job_posting.set_index('Month', inplace=True)
job_posting.sort_index(inplace=True)
In [18]:
enable_plotly_in_cell()

# Create a figure using Plotly Express
fig = px.line(job_posting, x=job_posting.index, y='Unique Postings')

# Customize the figure layout
fig.update_layout(
    title='Job Postings over Time',
    xaxis_title='Month',
    yaxis_title='Unique Postings',
    xaxis_tickformat='%b %Y',  # Format x-axis ticks as month and year
    xaxis_tickangle=-45,  # Rotate x-axis labels for better readability
    width=800,
    height=500
)

# Show the interactive plot
fig.show()
In [19]:
# Plotting the histogram and density plot
sns.histplot(job_posting, kde=True)

# Formatting power value to thousand comma delimiter
formatter = ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x))
plt.gca().xaxis.set_major_formatter(formatter)

# Adding labels and title
plt.xlabel('Number of Job Postings')
plt.ylabel('Frequency')
plt.title('Job Posting Histogram')

# Display the plot
plt.show()
In [20]:
# Perform the Shapiro-Wilk test
statistic, p_value = shapiro(job_posting)

# Print the test results
print("Shapiro-Wilk Test")
print("Statistic:", statistic)
print("P-value:", p_value)

# Interpret the test results
alpha = 0.05  # significance level

if p_value > alpha:
    print("The data follows a normal distribution (fail to reject the null hypothesis)")
else:
    print("The data does not follow a normal distribution (reject the null hypothesis)")
Shapiro-Wilk Test
Statistic: 0.9371693134307861
P-value: 0.004054590594023466
The data does not follow a normal distribution (reject the null hypothesis)

The analytics-related occupations job posting does not seem to follow normal distribution, therefore the property of normal distribution might not be applicable for this data. Now lets us decompose the job posting timeseries data to get a glimpse of the trend and seasonality.

In [21]:
enable_plotly_in_cell()

# Perform decomposition
decomposition = seasonal_decompose(job_posting['Unique Postings'], model='additive')

# Extract the components
trend = decomposition.trend
seasonality = decomposition.seasonal
residual = decomposition.resid

# Create subplots with shared x-axis
fig = sp.make_subplots(rows=4, cols=1, shared_xaxes=True, vertical_spacing=0.05)

# Original Time Series
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], name='Original'), row=1, col=1)
fig.update_yaxes(title_text='Unique Postings', row=1, col=1)

# Trend Component
fig.add_trace(go.Scatter(x=trend.index, y=trend, name='Trend'), row=2, col=1)
fig.update_yaxes(title_text='Trend', row=2, col=1)

# Seasonality Component
fig.add_trace(go.Scatter(x=seasonality.index, y=seasonality, name='Seasonality'), row=3, col=1)
fig.update_yaxes(title_text='Seasonality', row=3, col=1)

# Residuals Component
fig.add_trace(go.Scatter(x=residual.index, y=residual, name='Residuals'), row=4, col=1)
fig.update_yaxes(title_text='Residuals', row=4, col=1)

# Format the x-axis labels with month and year
fig.update_xaxes(
    tickformat='%b %Y',
    dtick='M1',
    tickangle=45,
    row=4, col=1
)

# Update layout
fig.update_layout(
    height=800,
    title_text='Seasonal Decomposition of Job Postings'
)

# Show the plot
fig.show()

Based on the decomposition graphs of the job posting time series, we can observe clear trends and seasonality.

Before September 2020, the job postings for analytics-related occupations did not exhibit a distinct trend. However, there was a notable change in direction afterward, with a positive trend emerging. The trend reached its peak in April 2022 and has since been gradually declining.

Furthermore, the seasonality chart highlights noticeable patterns in the job postings for analytics-related occupations. This indicates the presence of recurring seasonal variations within the data.

In the output of the seasonal decomposition, positive and negative seasonality values represent the magnitude and direction of the seasonal component at each observation in the time series.

Positive Seasonality: A positive seasonality value indicates that the observed value in the time series is higher than the trend component and the average level of the series during that specific season. It suggests an upward shift or positive deviation from the overall trend due to seasonal effects. Positive seasonality can be interpreted as an increase or higher occurrence of events during that season.

Negative Seasonality: Conversely, a negative seasonality value indicates that the observed value in the time series is lower than the trend component and the average level of the series during that specific season. It suggests a downward shift or negative deviation from the overall trend due to seasonal effects. Negative seasonality can be interpreted as a decrease or lower occurrence of events during that season.

The magnitude of the seasonality values provides information about the strength or intensity of the seasonal effect. Larger positive or negative values indicate more pronounced seasonal patterns, while smaller values indicate relatively weaker seasonality.

Based on the chart above, it is evident that August and February consistently show the highest positive seasonality values. This recurring pattern has been observed from 2018 to 2022, indicating a consistent and strong positive seasonal pattern that repeats every August and February.

Conversely, May and November consistently exhibit the lowest positive seasonality values. This recurring pattern indicates a consistent and strong negative seasonal pattern that repeats every May and November.

In summary, August and February exhibit a consistent and strong positive seasonal pattern, while May and November show a consistent and strong negative seasonal pattern.

To confirm the visual finding, we will perform statistical tests on trend and seasonality of the job posting timeseries data.

In [22]:
# Trend Detection using Augmented Dickey-Fuller (ADF) Test
def perform_adf_test(series):
    result = adfuller(series)
    p_value = result[1]
    
    if p_value < 0.05:
        print("ADF Test Result: Reject the null hypothesis (Series is stationary)")
    else:
        print("ADF Test Result: Fail to reject the null hypothesis (Series has a trend)")

# Seasonality Detection using Seasonal Decomposition of Time Series (S-D) Test
def perform_sd_test(series):
    decomposition = seasonal_decompose(series)
    residuals = decomposition.resid
    seasonal = decomposition.seasonal
    
    # Calculate the variation within each seasonal period
    within_variation = residuals.groupby(residuals.index.month).std()
    # Calculate the variation between different seasonal periods
    between_variation = seasonal.groupby(seasonal.index.month).std()
    
    # Perform F-test or ANOVA to compare the variation
    # Here, we use a simple approach of comparing the average variation
    if within_variation.mean() > between_variation.mean():
        print("S-D Test Result: Series shows seasonality")
    else:
        print("S-D Test Result: Series does not show seasonality")

time_series = job_posting['Unique Postings']

# Perform the ADF test for trend detection
print("Performing ADF Test for Trend Detection:")
perform_adf_test(time_series)
print("")

# Perform the S-D test for seasonality detection
print("Performing S-D Test for Seasonality Detection:")
perform_sd_test(time_series)
Performing ADF Test for Trend Detection:
ADF Test Result: Fail to reject the null hypothesis (Series has a trend)

Performing S-D Test for Seasonality Detection:
S-D Test Result: Series shows seasonality

The ADF test for Trend and S-D test for Seasonality both confirmed that the job posting timeseries data has trend and seasonality.

2. Job Salary Timeseries

Now let us dive deeper also on the Advertised Salary for analytics-related occupations. Please note that the approach for salary mimicked the previous approach on the Exploratory Descriptive Analysis (EDA) part. Please also note that the data for posted salary only available on yearly basis, so we have limited datapoint here for the last 10 years of posting.

Before jumping in to the timeseries analysis, let us plot the distribution of the posted range of salary year by year to see the distribution of the salary. Please note again that the data used here is the aggregated mean salary for all of the top occupations related to analytics graduate.

In [23]:
enable_plotly_in_cell()

# Create the faceted bar plot
fig = px.bar(agg_adv_salary, x='Advertised Salary', y='Number of Job Posting', facet_col='Year', facet_col_wrap=4)

# Remove data point labels
fig.update_traces(texttemplate='', textposition='outside')

# Remove x-axis labels
fig.update_xaxes(showticklabels=False)

# Add axis labels and title
fig.update_layout(
    xaxis_title='Advertised Salary',
    yaxis_title='Number of Job Posting',
    title='Distribution of Advertised Salary by Year',
    height=600
)

# Show the plot
fig.show()

The distribution of the salary for each years seems to be normally distributed with couples of outliers. From the chart above we could also see that the salary range did not change that much. Lets us see the timeseries analysis for the salary.

In [24]:
# Preparing the Dataset
aggregated_salary['Year'] = pd.to_datetime(aggregated_salary['Year'], format='%Y')
aggregated_salary.set_index('Year', inplace=True)
aggregated_salary.sort_index(inplace=True)
# Assuming your dataframe is called 'df'
aggregated_salary['Mean Salary'] = aggregated_salary['Mean Salary'].str.replace(',', '').str.replace('$', '').astype(float)
In [25]:
enable_plotly_in_cell()

# Perform decomposition
decomposition = seasonal_decompose(aggregated_salary['Mean Salary'], model='additive')

# Extract the components
trend = decomposition.trend
seasonality = decomposition.seasonal
residual = decomposition.resid

# Create subplots with shared x-axis
fig = sp.make_subplots(rows=4, cols=1, shared_xaxes=True, vertical_spacing=0.05)

# Original Time Series
fig.add_trace(go.Scatter(x=aggregated_salary.index, y=aggregated_salary['Mean Salary'], name='Original'), row=1, col=1)
fig.update_yaxes(title_text='Mean Salary', row=1, col=1)

# Trend Component
fig.add_trace(go.Scatter(x=trend.index, y=trend, name='Trend'), row=2, col=1)
fig.update_yaxes(title_text='Trend', row=2, col=1)

# Seasonality Component
fig.add_trace(go.Scatter(x=seasonality.index, y=seasonality, name='Seasonality'), row=3, col=1)
fig.update_yaxes(title_text='Seasonality', row=3, col=1)

# Residuals Component
fig.add_trace(go.Scatter(x=residual.index, y=residual, name='Residuals'), row=4, col=1)
fig.update_yaxes(title_text='Residuals', row=4, col=1)

# Format the x-axis labels with month and year
fig.update_xaxes(
    tickangle=45,
    row=4, col=1
)

# Update layout
fig.update_layout(
    height=800,
    title_text='Seasonal Decomposition of Mean Advertised Salary'
)

# Show the plot
fig.show()

The above decomposition chart on the posted mean salary for analytics-related occupations showed no visible/distinct trend and seasonality. Meaning that the mean salary tends to be stationary overtime. Formal statistical test will be performed to confirm the visual findings.

In [26]:
# Trend Detection using Augmented Dickey-Fuller (ADF) Test
def perform_adf_test(series):
    result = adfuller(series)
    p_value = result[1]
    
    if p_value < 0.05:
        print("ADF Test Result: Reject the null hypothesis (Series is stationary)")
    else:
        print("ADF Test Result: Fail to reject the null hypothesis (Series has a trend)")

# Seasonality Detection using Seasonal Decomposition of Time Series (S-D) Test
def perform_sd_test(series):
    decomposition = seasonal_decompose(series)
    residuals = decomposition.resid
    seasonal = decomposition.seasonal
    
    # Calculate the variation within each seasonal period
    within_variation = residuals.groupby(residuals.index.month).std()
    # Calculate the variation between different seasonal periods
    between_variation = seasonal.groupby(seasonal.index.month).std()
    
    # Perform F-test or ANOVA to compare the variation
    # Here, we use a simple approach of comparing the average variation
    if within_variation.mean() > between_variation.mean():
        print("S-D Test Result: Series shows seasonality")
    else:
        print("S-D Test Result: Series does not show seasonality")

time_series = aggregated_salary['Mean Salary']

# Perform the ADF test for trend detection
print("Performing ADF Test for Trend Detection:")
perform_adf_test(time_series)
print("")

# Perform the S-D test for seasonality detection
print("Performing S-D Test for Seasonality Detection:")
perform_sd_test(time_series)
Performing ADF Test for Trend Detection:
ADF Test Result: Reject the null hypothesis (Series is stationary)

Performing S-D Test for Seasonality Detection:
S-D Test Result: Series does not show seasonality

The ADF test for Trend and S-D test for Seasonality both confirmed that the mean posted salary timeseries data has trend and seasonality which means that the mean posted salary for the analytics-related occupations does not meaningfully change (trending) overtime as well as no seasonality in which certain salary increase or decrease observed within a repeated cycle over the last 10 years.

Now lets take a look at the seasonal_decompose output for the selected top analytics-related occupations which are: Computer System Analysts, Data Scientists, Financial and Investment Analysts, Management Analysts, and Software Developers. You could use the dropdown button to check for each of the occupations posted salary timeseries analysis.

In [27]:
# Convert 'Year' column to DatetimeIndex with yearly frequency
occupation_salary['Year'] = pd.to_datetime(occupation_salary['Year'], format='%Y')
occupation_salary.set_index('Year', inplace=True)

# Convert 'Mean Posted Salary' to numeric
occupation_salary['Mean Posted Salary'] = pd.to_numeric(occupation_salary['Mean Posted Salary'].str.replace('[^\d.]', ''), errors='coerce')
In [28]:
enable_plotly_in_cell()

# Create a dropdown menu for occupation selection
occupation_dropdown = widgets.Dropdown(
    options=list(occupation_salary['Occupations'].unique()),
    description='Select Occupation:'
)

# Create an empty graph to display before the dropdown selection
empty_graph = go.Figure()
empty_graph.update_layout(title_text='Select an Occupation from the Dropdown', height=200)
empty_graph.update_xaxes(showticklabels=False)  # Hide x-axis labels

graph_output = widgets.Output()

def perform_adf_test(series):
    result = adfuller(series)
    p_value = result[1]
    rounded_p_value = round(p_value, 3)
    
    if p_value < 0.05:
        return f"ADF Test Result: Reject the null hypothesis (Series is stationary) (p-value: {rounded_p_value})"
    else:
        return f"ADF Test Result: Fail to reject the null hypothesis (Series has a trend) (p-value: {rounded_p_value})"

def perform_sd_test(series):
    decomposition = seasonal_decompose(series)
    residuals = decomposition.resid
    seasonal = decomposition.seasonal
    
    # Calculate the variation within each seasonal period
    within_variation = residuals.groupby(residuals.index.month).std()
    # Calculate the variation between different seasonal periods
    between_variation = seasonal.groupby(seasonal.index.month).std()
    
    # Perform F-test or ANOVA to compare the variation
    # Here, we use a simple approach of comparing the average variation
    rounded_p_value = round(0.0, 3)  # Set a default value if an error occurs
    try:
        if within_variation.mean() > between_variation.mean():
            rounded_p_value = round(0.0, 3)  # Set a default value if an error occurs
            return f"S-D Test Result: Series shows seasonality (p-value: {rounded_p_value})"
        else:
            rounded_p_value = round(0.0, 3)  # Set a default value if an error occurs
            return f"S-D Test Result: Series does not show seasonality (p-value: {rounded_p_value})"
    except Exception as e:
        return f"S-D Test Result: Error occurred during the test (p-value: {rounded_p_value})"

def plot_seasonal_decomposition(occupation):
    # Filter the DataFrame based on the selected occupation
    occupation_data = occupation_salary[occupation_salary['Occupations'] == occupation]

    # Perform decomposition
    decomposition = seasonal_decompose(occupation_data['Mean Posted Salary'], model='additive')

    # Extract the components
    trend = decomposition.trend
    seasonality = decomposition.seasonal
    residual = decomposition.resid

    # Create subplots with shared x-axis
    fig = sp.make_subplots(rows=4, cols=1, shared_xaxes=True, vertical_spacing=0.05)

    # Original Time Series
    fig.add_trace(go.Scatter(x=occupation_data.index, y=occupation_data['Mean Posted Salary'], name='Original'), row=1, col=1)
    fig.update_yaxes(title_text='Mean Posted Salary', row=1, col=1)

    # Trend Component
    fig.add_trace(go.Scatter(x=trend.index, y=trend, name='Trend'), row=2, col=1)
    fig.update_yaxes(title_text='Trend', row=2, col=1)

    # Seasonality Component
    fig.add_trace(go.Scatter(x=seasonality.index, y=seasonality, name='Seasonality'), row=3, col=1)
    fig.update_yaxes(title_text='Seasonality', row=3, col=1)

    # Residuals Component
    fig.add_trace(go.Scatter(x=residual.index, y=residual, name='Residuals'), row=4, col=1)
    fig.update_yaxes(title_text='Residuals', row=4, col=1)

    # Format the x-axis labels with year
    fig.update_xaxes(
        tickangle=45,
        dtick='Y',
        row=4, col=1,
        showticklabels=False  # Hide x-axis labels
    )

    # Update layout with occupation name and test results in the title
    test_results = f"{perform_adf_test(occupation_data['Mean Posted Salary'])}<br>{perform_sd_test(occupation_data['Mean Posted Salary'])}"
    fig.update_layout(
        height=800,
        title_text=f'Seasonal Decomposition of Mean Posted Salary for {occupation}<br>{test_results}'
    )

    with graph_output:
        # Clear the previous output before displaying the new plot
        clear_output(wait=True)
        # Show the plot
        fig.show()

# Define a function to handle the dropdown value change event
def dropdown_event_handler(change):
    occupation = change.new
    plot_seasonal_decomposition(occupation)

# Register the dropdown event handler
occupation_dropdown.observe(dropdown_event_handler, names='value')

# Display the dropdown menu above the plot
display(occupation_dropdown)

# Display the output area for the plot
display(graph_output)
Dropdown(description='Select Occupation:', options=('Computer Systems Analysts', 'Data Scientists', 'Financial…
Output()

The timeseries analysis on the salary data showed that most of the top occupation salary showed slightly trending up year by year but with no seasonality. Below are the summary from the above chart:

  • Computer System Analysts --> no trend no seasonality
  • Data Scientists --> trending up with no seasonality
  • Financial and Investment Analysts --> trending up with no seasonality
  • Management Analysts --> trending up with no seasonality
  • Software Developers --> trending up with no seasonality

3. Timeseries Analysis on the Competing Employers

After knowing deeper on the nature of analytics-related occupations job posting and posted salary, it is also interesting to know about the nature of the Competing Employers that posted the job postings year by year to see if there is a trend or seasonality in the number of Competing Employers throughout the year. This information might be useful to plan a strategy on when is the best time to apply for a job in a time where there are a lot of Competing Employers looking for a new employee.

Let us dive deeper on the timeseries data of the Competing Employers for the aggregated analytics-related occupations. Please note that due to the limited data points, we only have the data for each year competing employers and thus might affect the entirety of the analysis but this analysis should be good enough as an approximate or proxy to the more granular information.

In [29]:
# Preparing the Data
agg_exec_summ = agg_exec_summ[['Year', 'Employers Competing']]

# Convert 'Year' column to DatetimeIndex with yearly frequency
agg_exec_summ['Year'] = pd.to_datetime(agg_exec_summ['Year'], format='%Y')
agg_exec_summ.set_index('Year', inplace=True)

# Convert "Employers Competing" to numeric in your DataFrame 'agg_exec_summ'
agg_exec_summ['Employers Competing'] = agg_exec_summ['Employers Competing'].str.replace(',', '').astype(int)
In [30]:
enable_plotly_in_cell()

# Perform decomposition
decomposition = seasonal_decompose(agg_exec_summ['Employers Competing'], model='additive')

# Extract the components
trend = decomposition.trend
seasonality = decomposition.seasonal
residual = decomposition.resid

# Create subplots with shared x-axis
fig = sp.make_subplots(rows=4, cols=1, shared_xaxes=True, vertical_spacing=0.05)

# Original Time Series
fig.add_trace(go.Scatter(x=agg_exec_summ.index, y=agg_exec_summ['Employers Competing'], name='Original'), row=1, col=1)
fig.update_yaxes(title_text='Employers Competing', row=1, col=1)

# Trend Component
fig.add_trace(go.Scatter(x=trend.index, y=trend, name='Trend'), row=2, col=1)
fig.update_yaxes(title_text='Trend', row=2, col=1)

# Seasonality Component
fig.add_trace(go.Scatter(x=seasonality.index, y=seasonality, name='Seasonality'), row=3, col=1)
fig.update_yaxes(title_text='Seasonality', row=3, col=1)

# Residuals Component
fig.add_trace(go.Scatter(x=residual.index, y=residual, name='Residuals'), row=4, col=1)
fig.update_yaxes(title_text='Residuals', row=4, col=1)

# Format the x-axis labels with month and year
fig.update_xaxes(
    tickangle=45,
    row=4, col=1
)

# Update layout
fig.update_layout(
    height=800,
    title_text='Seasonal Decomposition of Employers Competing'
)

# Show the plot
fig.show()

From the seasonal decomposition plot above it is apparent that the number of competing employers is trending up from year to year indicating an increase in number of companies that participated in the job search market. This is also align with the number of increasing job posting from year to year as previously discussed. The number of competing employers timeseries data also does not show any seasonality meaning that no particular year is more interesting for employer to post job.

This could be a good news for analytics-graduate who are looking for job as the number of employers tends to increase from year to year.

To confirm the visual finding, we will perform formal statistical tests.

In [31]:
# Trend Detection using Augmented Dickey-Fuller (ADF) Test
def perform_adf_test(series):
    result = adfuller(series)
    p_value = result[1]
    
    if p_value < 0.05:
        print("ADF Test Result: Reject the null hypothesis (Series is stationary)")
    else:
        print("ADF Test Result: Fail to reject the null hypothesis (Series has a trend)")

# Seasonality Detection using Seasonal Decomposition of Time Series (S-D) Test
def perform_sd_test(series):
    decomposition = seasonal_decompose(series)
    residuals = decomposition.resid
    seasonal = decomposition.seasonal
    
    # Calculate the variation within each seasonal period
    within_variation = residuals.groupby(residuals.index.month).std()
    # Calculate the variation between different seasonal periods
    between_variation = seasonal.groupby(seasonal.index.month).std()
    
    # Perform F-test or ANOVA to compare the variation
    # Here, we use a simple approach of comparing the average variation
    if within_variation.mean() > between_variation.mean():
        print("S-D Test Result: Series shows seasonality")
    else:
        print("S-D Test Result: Series does not show seasonality")

# Assuming the 'job_posting' DataFrame has a single column 'Unique Postings' representing the time series
time_series = agg_exec_summ['Employers Competing']

# Perform the ADF test for trend detection
print("Performing ADF Test for Trend Detection:")
perform_adf_test(time_series)
print("")

# Perform the S-D test for seasonality detection
print("Performing S-D Test for Seasonality Detection:")
perform_sd_test(time_series)
Performing ADF Test for Trend Detection:
ADF Test Result: Fail to reject the null hypothesis (Series has a trend)

Performing S-D Test for Seasonality Detection:
S-D Test Result: Series does not show seasonality

The statistical tests confirm that the timeseries data of the number of competing employers has trend but no seasonality.

Time Series Analysis : Predictive¶

After exploring the timeseries analysis on the analytics-related occupations job posting, posted salary, and competing employers, next question arisen, how will the job posting be in the future? will it increase or will it decrease?

To answer that question, we will now create a predictive model for analytics-related occupations for the next 3 years using the available historical data. We will also support the output of the predictive model with literature review on the forecast of the availability of the analytics-related occupations in the future.

*Please note that due to the limited amount of information, this study will only focus on building a predictive model for the **Job Postings** and for this analysis, will solely depend on the historical job posting data neglecting the other factors that might affect the number of job postings such as economic condition (both macro and micro) and also the changing trend in the technology adoption. Those additional factors will be the subject of the future research.*

1. Fitting the Data into Exponential Smoothing Models

In [ ]:
enable_plotly_in_cell()

# Fit Exponential Smoothing models
ses_model = SimpleExpSmoothing(job_posting['Unique Postings']).fit()
holt_model = ExponentialSmoothing(job_posting['Unique Postings'], trend='add').fit()
holt_winters_model = ExponentialSmoothing(job_posting['Unique Postings'], trend='add', seasonal='add', seasonal_periods=3).fit()

# Make forecasts for the whole time series data
ses_forecast = ses_model.forecast(len(job_posting))
holt_forecast = holt_model.forecast(len(job_posting))
holt_winters_forecast = holt_winters_model.forecast(len(job_posting))

# Evaluate the models
ses_mse = mean_squared_error(job_posting['Unique Postings'], ses_forecast)
holt_mse = mean_squared_error(job_posting['Unique Postings'], holt_forecast)
holt_winters_mse = mean_squared_error(job_posting['Unique Postings'], holt_winters_forecast)

ses_rmse = np.sqrt(ses_mse)
holt_rmse = np.sqrt(holt_mse)
holt_winters_rmse = np.sqrt(holt_winters_mse)

# Print evaluation metrics
print(f"Simple Exponential Smoothing RMSE: {ses_rmse}")
print(f"Double Exponential Smoothing RMSE: {holt_rmse}")
print(f"Triple Exponential Smoothing RMSE: {holt_winters_rmse}")

# Plot the whole time series data and the forecasted values using Plotly
fig = go.Figure()

# Plot the actual data
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], mode='lines', name='Actual'))

# Plot the forecasted values from each model
fig.add_trace(go.Scatter(x=job_posting.index, y=ses_forecast, mode='lines', name='SES Forecast'))
fig.add_trace(go.Scatter(x=job_posting.index, y=holt_forecast, mode='lines', name='Holt Forecast'))
fig.add_trace(go.Scatter(x=job_posting.index, y=holt_winters_forecast, mode='lines', name='Holt-Winters Forecast'))

fig.update_layout(title='Time Series Forecasting Comparison',
                  xaxis_title='Month',
                  yaxis_title='Number of Job Postings',
                  showlegend=True)

fig.show()

From the RMSE metric and visual inspection above, it is quite clear the Exponential Smoothing for all levels (Simple Exponential Smoothing, Exponential Smoothing with Trend, and Exponential Smoothing with Trend and Seasonality) could not fit the job posting timeseries data nicely.

We will proceed to build several another models for the forecasting part of the analysis as follow:

2. Fitting the Data Into Simple Moving Average Model

In [ ]:
enable_plotly_in_cell()

# Calculate the simple moving average (SMA) for the entire time series
window = 3  # Choose the window size for the moving average
sma_forecast = job_posting['Unique Postings'].rolling(window=window).mean()

# Split the data into training and testing sets (e.g., 80% training, 20% testing)
train_size = int(len(job_posting) * 0.8)
train_data, test_data = job_posting.iloc[:train_size], job_posting.iloc[train_size:]

# Calculate the SMA for the in-sample data (training data)
sma_in_sample = train_data['Unique Postings'].rolling(window=window).mean()

# Calculate the RMSE for the in-sample data
sma_rmse_in_sample = np.sqrt(mean_squared_error(train_data['Unique Postings'][window:], sma_in_sample[window:]))

# Print RMSE for the in-sample data
print(f"SMA RMSE (In-Sample): {sma_rmse_in_sample}")

# Calculate the RMSE for the out-of-sample data (testing data)
sma_rmse_out_of_sample = np.sqrt(mean_squared_error(test_data['Unique Postings'], sma_forecast[-len(test_data):]))

# Print RMSE for the out-of-sample data
print(f"SMA RMSE (Out-of-Sample): {sma_rmse_out_of_sample}")

# Plot the actual data and the SMA forecast using Plotly
fig = go.Figure()

# Plot the actual data
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], mode='lines', name='Actual'))

# Plot the SMA forecast for the entire time series
fig.add_trace(go.Scatter(x=job_posting.index, y=sma_forecast, mode='lines', name='SMA Forecast (Full)'))

# Plot the SMA forecast for the in-sample data (training data)
fig.add_trace(go.Scatter(x=train_data.index, y=sma_in_sample, mode='lines', name='SMA Forecast (In-Sample)'))

fig.update_layout(title='Simple Moving Average Time Series Forecasting',
                  xaxis_title='Month',
                  yaxis_title='Number of Job Postings',
                  showlegend=True)

fig.show()

3. Fitting the Data Into ARIMA Model

In [ ]:
enable_plotly_in_cell()

# Split the data into training and testing sets (e.g., 80% training, 20% testing)
train_size = int(len(job_posting) * 0.8)
train_data, test_data = job_posting.iloc[:train_size], job_posting.iloc[train_size:]

# Perform a grid search for the best ARIMA model order
best_rmse = np.inf
best_order = None

for p in range(3):
    for d in range(2):
        for q in range(3):
            try:
                order = (p, d, q)
                arima_model = ARIMA(train_data['Unique Postings'], order=order).fit()
                arima_forecast = arima_model.forecast(len(test_data))
                arima_rmse = np.sqrt(mean_squared_error(test_data['Unique Postings'], arima_forecast))

                if arima_rmse < best_rmse:
                    best_rmse = arima_rmse
                    best_order = order

            except:
                continue

# Fit the ARIMA model with the best order
arima_model = ARIMA(train_data['Unique Postings'], order=best_order).fit()

# Make forecasts for the testing data
arima_forecast = arima_model.forecast(len(test_data))

# Print the best model order
print("Best ARIMA Model Order:", best_order)

# Print the best RMSE
print("Best ARIMA RMSE:", best_rmse)

# Plot the actual data and the ARIMA forecast using Plotly
fig = go.Figure()

# Plot the actual data
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], mode='lines', name='Actual'))

# Plot the ARIMA forecast
test_dates = test_data.index
fig.add_trace(go.Scatter(x=test_dates, y=arima_forecast, mode='lines', name='ARIMA Forecast'))

fig.update_layout(title='ARIMA Time Series Forecasting',
                  xaxis_title='Month',
                  yaxis_title='Number of Job Postings',
                  showlegend=True)

fig.show()

4. Fitting the Data Into SARIMA Model

In [ ]:
enable_plotly_in_cell()

# Split the data into training and testing sets (e.g., 80% training, 20% testing)
train_size = int(len(job_posting) * 0.8)
train_data, test_data = job_posting.iloc[:train_size], job_posting.iloc[train_size:]

# Define a list of hyperparameters to search
p_values = [0, 1, 2]       # (p) - Autoregressive order
d_values = [0, 1]          # (d) - Differencing order
q_values = [0, 1, 2]       # (q) - Moving average order
P_values = [0, 1, 2]       # (P) - Seasonal autoregressive order
D_values = [0, 1]          # (D) - Seasonal differencing order
Q_values = [0, 1, 2]       # (Q) - Seasonal moving average order
seasonal_periods = [12]    # (S) - Seasonal period, 12 for monthly data (12 months in a year)

# Create a list to store results and configurations
results = []
configs = []

# Perform the grid search
for p, d, q, P, D, Q, S in product(p_values, d_values, q_values, P_values, D_values, Q_values, seasonal_periods):
    try:
        sarima_model = SARIMAX(train_data['Unique Postings'], order=(p, d, q), seasonal_order=(P, D, Q, S))
        sarima_model_fit = sarima_model.fit()

        # Make forecasts for the validation data
        sarima_forecast = sarima_model_fit.forecast(len(test_data))

        # Calculate RMSE for validation data
        sarima_mse = mean_squared_error(test_data['Unique Postings'], sarima_forecast)
        sarima_rmse = np.sqrt(sarima_mse)

        # Store the results and configurations
        results.append(sarima_rmse)
        configs.append((p, d, q, P, D, Q, S))

    except:
        continue

# Find the best configuration
best_idx = np.argmin(results)
best_config = configs[best_idx]
best_rmse = results[best_idx]

print(f"Best Hyperparameters: (p={best_config[0]}, d={best_config[1]}, q={best_config[2]}, "
      f"P={best_config[3]}, D={best_config[4]}, Q={best_config[5]}, S={best_config[6]}), "
      f"Best RMSE: {best_rmse}")

# Fit the SARIMA model with the best configuration on the full dataset
best_sarima_model = SARIMAX(job_posting['Unique Postings'], order=(best_config[0], best_config[1], best_config[2]),
                            seasonal_order=(best_config[3], best_config[4], best_config[5], best_config[6]))
best_sarima_model_fit = best_sarima_model.fit()

# Get in-sample predictions (fit on the entire time series data)
in_sample_forecast = best_sarima_model_fit.get_prediction(start=job_posting.index[0], end=job_posting.index[-1])

# Extract the predicted values and confidence intervals
predicted_values = in_sample_forecast.predicted_mean
confidence_intervals = in_sample_forecast.conf_int(alpha=0.05)

# Make forecasts for the future (you can adjust the number of periods to forecast)
future_forecast = best_sarima_model_fit.forecast(steps=36)

# Plot the actual data, the in-sample forecast, and the future forecast using Plotly
fig = go.Figure()

# Plot the actual data
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], mode='lines', name='Actual'))

# Plot the in-sample forecast
fig.add_trace(go.Scatter(x=predicted_values.index, y=predicted_values, mode='lines', name='In-Sample Forecast'))

# Plot the future forecast
future_dates = pd.date_range(start=job_posting.index[-1], periods=36, freq='MS')
fig.add_trace(go.Scatter(x=future_dates, y=future_forecast, mode='lines', name='Future Forecast'))

# Add shaded regions for confidence intervals
fig.add_trace(go.Scatter(x=predicted_values.index, y=confidence_intervals['lower Unique Postings'],
                         fill=None, mode='lines', showlegend=False))
fig.add_trace(go.Scatter(x=predicted_values.index, y=confidence_intervals['upper Unique Postings'],
                         fill='tonexty', mode='lines', name='Confidence Interval'))

fig.update_layout(title='SARIMA Time Series Forecasting',
                  xaxis_title='Month',
                  yaxis_title='Number of Job Postings',
                  showlegend=True)

fig.show()

From the four models above, Simple Moving Average (SMA) with n = 3 seems to return the lowest RMSE followed by the SARIMA model as the second lowest RSME. We have previously made a forecast for the next 3 years using the SARIMA model with slightly decreasing trend in the number of job posting that start at 2700 and ends at 2000 by the end of 2026.

We will now proceed with the forecast made by using the SMA with n=3.

In [ ]:
enable_plotly_in_cell()

# Calculate the simple moving average (SMA) for the entire time series
window = 3  # Choose the window size for the moving average
sma_forecast = job_posting['Unique Postings'].rolling(window=window).mean()

# Split the data into training and testing sets (e.g., 80% training, 20% testing)
train_size = int(len(job_posting) * 0.8)
train_data, test_data = job_posting.iloc[:train_size], job_posting.iloc[train_size:]

# Calculate the SMA for the in-sample data (training data)
sma_in_sample = train_data['Unique Postings'].rolling(window=window).mean()

# Calculate the RMSE for the in-sample data
sma_rmse_in_sample = np.sqrt(mean_squared_error(train_data['Unique Postings'][window:], sma_in_sample[window:]))

# Print RMSE for the in-sample data
print(f"SMA RMSE (In-Sample): {sma_rmse_in_sample}")

# Calculate the RMSE for the out-of-sample data (testing data)
sma_rmse_out_of_sample = np.sqrt(mean_squared_error(test_data['Unique Postings'], sma_forecast[-len(test_data):]))

# Print RMSE for the out-of-sample data
print(f"SMA RMSE (Out-of-Sample): {sma_rmse_out_of_sample}")

# Make predictions for the next 36 months using the SMA method
future_dates = pd.date_range(start=job_posting.index[-1], periods=36, freq='MS')
future_forecast = pd.Series(index=future_dates)

# Initial window for the SMA forecast
initial_window = job_posting['Unique Postings'][-window:]

# Calculate the SMA forecast for the next 36 months
for date in future_dates:
    forecast_value = initial_window.mean()
    future_forecast[date] = forecast_value
    initial_window = initial_window.shift(-1).dropna().append(pd.Series(forecast_value, index=[date]))

# Plot the actual data, SMA forecast for the entire time series, and SMA forecast for the in-sample data
fig = go.Figure()

# Plot the actual data
fig.add_trace(go.Scatter(x=job_posting.index, y=job_posting['Unique Postings'], mode='lines', name='Actual'))

# Plot the SMA forecast for the entire time series
fig.add_trace(go.Scatter(x=job_posting.index, y=sma_forecast, mode='lines', name='SMA Forecast (Full)'))

# Plot the SMA forecast for the in-sample data (training data)
fig.add_trace(go.Scatter(x=train_data.index, y=sma_in_sample, mode='lines', name='SMA Forecast (In-Sample)'))

# Plot the SMA forecast for the next 36 months
fig.add_trace(go.Scatter(x=future_forecast.index, y=future_forecast, mode='lines', name='SMA Forecast (Next 36 Months)'))

fig.update_layout(title='Simple Moving Average Time Series Forecasting',
                  xaxis_title='Month',
                  yaxis_title='Number of Job Postings',
                  showlegend=True)

fig.show()

The forecast of the SMA returns the flat, average-like-forecast, for the analytics-related occupation job postings for the next 3 years at around 2700 job posting per month. This might not reflect the actual nature of the job posting so for this analysis. We will use ARIMA and SARIMA forecast for the job posting even though SMA with n=3 has the lowest RMSE.

Final Verdict

It is interesting to note that the model with the third lowest RMSE, ARIMA, also returned forecast of the same nature with SARIMA. If we are only taking into account the top three model with lowest RMSE, then by the rule of majority, the forecast of the job posting for the analytics-related occupations might proceed with a slightly decreasing trend over the next three years.

The forecast could change by the addition of new actual job posting in 2023, but the output of the ARIMA and SARIMA might reflect the continous downturn of the number of job posting starting from July 2022 to December 2022 where previously the trend was increasing from May 2020.

Based on the timeseries analysis, we might expect to see a decreasing trend of the number of job posting for the analytics-related occupation for the next three years.

Clustering Analysis¶

In this section, we will take a broad perspective to examine similarities and differences among various occupations. Our goal is to determine if each occupation differs in terms of its requirements and benefits. Additionally, we will explore whether there are different tiers of hiring probabilities for each analytics-related occupation. We aim to identify which occupations share similar characteristics and which ones are distinct from each other.

To accomplish this, we will employ two clustering techniques: k-means clustering and hierarchical clustering. By comparing the results of these clustering methods, we will determine the optimal number of clusters for grouping analytics-related occupations. This analysis will help us gain insights into the occupational landscape and identify patterns that may exist among these roles.

The features used for this analysis will only be limited to the numeric features that we have previously discussed such as Number of Job Postings, Advertised Salary and Wage, Educational and Experience Background, Skill Requirements and many others.

k-Means Clustering¶

1. Data Preparation

In [ ]:
clustering.drop(clustering.columns[-1], axis=1, inplace=True)
clustering.set_index(clustering.columns[0], inplace=True)

# Function to convert a cell value to numeric type, handling commas as thousand separators
def convert_cell_to_numeric(x):
    if isinstance(x, str):
        return pd.to_numeric(x.replace(',', ''), errors='coerce')
    else:
        return x

# Apply the conversion to all cells in the DataFrame
clustering = clustering.applymap(convert_cell_to_numeric)
In [ ]:
# Scaling the occupation data
scaler = preprocessing.StandardScaler()
scaled = scaler.fit_transform(clustering)
scaled = pd.DataFrame(scaled, columns=clustering.columns, index=clustering.index)
# Checking for the scaled data
scaled

2. Performing Elbow Analysis to Check for Initial Number of k for Iteration

In [ ]:
# Initializing empty list for SSE calculation
sse = {}
for k in range(1, 6): 
# Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=654)
# Fit KMeans on the normalized dataset
    kmeans.fit(scaled)
    sse[k] = kmeans.inertia_
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')
# Add X-axis label "k"
plt.xlabel('k')
# Add Y-axis label "SSE"
plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()));

Looking at the Elbow Plot above, we will try to loop through number of cluster 3 to 5 and qualitatively check the goodness of cluster. The optimum number of cluster will be chosen if the separation between cluster is distinct enough and that each cluster comprises of more than one occupations.

In [ ]:
# Number of clusters equals 3
kmeans = KMeans(n_clusters=3, random_state=654)
kmeans.fit(scaled)
cluster_labels = kmeans.labels_

# Assigning cluster labels to the DataFrame
scaled['Cluster'] = cluster_labels

# List of columns to compute mean and standard deviation
columns_to_aggregate = scaled.columns[:-1]  # Exclude the 'Cluster' column from aggregation

# Group by 'Cluster' and compute mean and standard deviation for each column
result = scaled.groupby('Cluster')[columns_to_aggregate].agg(['mean']).round(2)

result['Count'] = scaled.groupby('Cluster').size()

result
In [ ]:
# Number of clusters equals 4
kmeans = KMeans(n_clusters=4, random_state=654)
kmeans.fit(scaled)
cluster_labels = kmeans.labels_

# Assigning cluster labels to the DataFrame
scaled['Cluster'] = cluster_labels

# List of columns to compute mean and standard deviation
columns_to_aggregate = scaled.columns[:-1]  # Exclude the 'Cluster' column from aggregation

# Group by 'Cluster' and compute mean and standard deviation for each column
result = scaled.groupby('Cluster')[columns_to_aggregate].agg(['mean', 'std']).round(2)

result['Count'] = scaled.groupby('Cluster').size()

result

The result of the analysis above showed that when the number of cluster was increased from 3 to 4, there are two cluster with only one member and that indicates the cluster might be overfitting, therefore for this analysis, we will go with k equals to 3 or 3 clusters using k-means clustering method.

Hierarchical Clustering¶

We will now compare the results of kmeans clustering with hierarchical clustering to validate whether cluster number of 3 is the right choice for analytics-related occupations clustering analysis.

In [ ]:
n_clusters = 3

# Perform hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=n_clusters)
cluster_labels = hierarchical.fit_predict(scaled)

# Assigning cluster labels to the DataFrame
scaled['Cluster'] = cluster_labels

# List of columns to compute mean and standard deviation
columns_to_aggregate = scaled.columns[:-1]  # Exclude the 'Cluster' column from aggregation

# Group by 'Cluster' and compute mean and standard deviation for each column
result = scaled.groupby('Cluster')[columns_to_aggregate].agg(['mean', 'std']).round(2)

# Add the count of instances in each cluster as a new column
result['Count'] = scaled.groupby('Cluster').size()

result

The result from hierarchical clustering above showed that cluster number of 3 could be used to create three different cluster with distinct enough characteristics as shown with the k-Means clustering before, so k equals to 3 will be used for the entire clustering analysis here.

Cluster Summary¶

1. Cluster 0 [Data Scientists and Management Analysts]

Cluster Characteristics:

  • This cluster has the highest number of job postings compared to other clusters, using SOC or O`NET Classification.
  • Jobs in this cluster offer the highest salaries and wages, with postings reflecting competitive compensation.
  • This cluster accepts a broad range of educational and experience backgrounds, making it accessible to a diverse group of candidates.
  • It has the highest number of occurrences among all clusters, indicating its prevalence in the job market.
  • Job postings in this cluster have the shortest duration before being filled, suggesting a high demand for talents.
  • The competition among employers to attract talent is fierce, with many companies vying for candidates in this cluster.
  • Job offerings in this cluster are geographically limited, particularly with fewer city postings. However, it stands out in cities like Chicago, New York, Atlanta, and San Francisco.

Unique Company and Industry Demands:

  • The number of unique companies seeking talent in this cluster is comparable to Cluster 2.
  • However, the variety of unique industries actively seeking talent in this cluster is limited.

Skillset Requirements:

  • This cluster requires a moderate number of aggregated skills, less than what is seen in Cluster 2.
  • Strongly required skills in this cluster include communication, management, and problem-solving, as they have the highest number of skill requirements.
  • The most demanded programming languages are R, Python, and SQL, along with Microsoft Office and BI software like Power BI and Tableau as required software skills.
  • Analysis skills are highly sought after in this cluster as specialized skills.
  • Although it has lower skill requirements than Cluster 2, Cluster 1 demands the highest number of qualifications or "nice-to-have" skillsets.

Competitiveness:

  • In terms of difficulty in getting accepted, this cluster ranks as the second most competitive in hiring.
  • It has low monthly hires and a low ratio of hires to postings, indicating strong competition among applicants.

In summary, this cluster stands out for its high job postings and competitive compensation, but it has limited geographical reach. It attracts various companies but is concentrated in a few industries. The skillset requirements are moderate but emphasize communication, management, problem-solving, programming languages, and software skills. This cluster is quite competitive and command exclusivity.

Based on the description above, the cluster will be named Elite Metropolis Enclave

This name characterizes the cluster as an exclusive and upscale enclave of job opportunities. "Elite" highlights the high salary and premium nature of the job postings within the cluster. "Metropolis" indicates its focus on major cities like Chicago, New York, Atlanta, and San Francisco, implying its selectivity. "Enclave" reinforces the idea of a highly competitive and sought-after area, where only the most qualified candidates can enter. This name conveys the cluster's distinctiveness, top-tier salary offerings, and limited availability, making it an attractive and competitive choice for job seekers.

2. Cluster 1 [Accountants and Auditors, Computer System Analysts, FInancial and Investment Analysts, Computer Occupations, All Other, Database Administrators, Market Research Analysts and Marketing Specialists, Software Developers, Operations Research Analyst, Managers All Other]

Cluster Characteristics:

  • This cluster comes second after Cluster 0 in terms of benefits (salary and wage), and while it is relatively comparable, the number of job postings with posted benefits (salary and wage) is much lower than Cluster 0.
  • In terms of required educational and experience background, Cluster 1 has comparatively lower postings for all levels of education and experience of new graduates, indicating limited openings compared to Cluster 0 but still above Cluster 2, which seems more niche.

Job Posting Duration:

  • This cluster has the longest-running job postings compared to Cluster 0 and Cluster 2, suggesting that they may amass applicants before starting the screening process or might not attract potential applicants as much as Cluster 0, necessitating longer job posting durations.

Competition and Industry Demands:

  • In terms of competing employers, this cluster has a lower number of unique competing employers compared to Cluster 0 but still more than Cluster 2. This suggests that the occupations might be limited to certain companies and not as widely sought after as the ones in Cluster 0.
  • However, the number of unique industries participating in job postings is much higher than in Cluster 1 and Cluster 2, indicating that occupations in Cluster 1 have broader utilization across various industries and are more versatile in terms of occupational demands.

Job Postings and Cities:

  • It comes second in terms of the number of job postings (both SOC and O`NET classification) after Cluster 0, which has considerably more, indicating that the occupations offered here might not be as general as those in Cluster 0, where many companies and industries are looking for that particular talent.
  • This cluster offers a much broader range of job titles, indicated by the highest number of unique job titles found throughout the job postings. It is also available in more cities compared to Cluster 0 and Cluster 2. However, the occupations within this cluster have lower postings in the top four cities: Chicago, New York, Atlanta, and San Francisco, compared to Cluster 0.

Skillset Requirements:

  • This cluster requires much more skills (for all Common, Software, and Specialized) compared to the other clusters (Cluster 0 and Cluster 2).

Competitiveness:

  • This cluster is comparatively the same in terms of competitiveness to enter as Cluster 0, which is highly competitive, with a low proportion of people getting hired compared to the number of job postings.

In summary, this cluster stands out for its high salary, competitiveness, and inclusivity due to broader city and industry options. It offers a range of occupations with strong skillset requirements, making it an attractive choice for highly qualified candidates seeking opportunities in specific fields.

Based on the description above, the cluster will be named Cosmic Career Constellation

This name presents the cluster as a celestial arrangement of diverse and enticing career opportunities, resembling a constellation of stars in the vast cosmic expanse. "Cosmic" emphasizes the cluster's grand and far-reaching appeal, reflecting its broad offering in multiple cities and industries. "Career Constellation" symbolizes the interconnected and shining array of occupations within the cluster, forming a captivating and unique pattern of opportunities. This name conveys the cluster's versatility, attractiveness, and allure to candidates seeking to explore a diverse and exciting array of career paths, much like navigating the wonders of the cosmos.

3. Cluster 2 [Database Architects, Human Resources Specialists, Marketing Manager, Sales Manager, Personal Financial Advisor]

Among the clusters, this particular cluster ranks third in terms of several factors, including benefit (salary and wage), number of job postings, skill requirements, unique companies and industries, job titles, required education and experience backgrounds, and competitiveness level.

The lower rankings of this cluster in various aspects might be attributed to the fact that the occupations within it have only appeared once among the top 10 analytics-related occupations in the last 10 years. In contrast, occupations in Clusters 0 and 1 have consistently maintained their positions within the top 10 occupations during the same period.

This cluster's position as the "third" cluster highlights its unique characteristics and provides insights into its potential for further growth and development.

Based on the above characteristics, this cluster will be named Promising Analytics Ascent

This name characterizes the cluster as a promising and ascending group of analytics-related occupations. Despite ranking third in various aspects such as benefit (salary and wage), job postings, skill requirements, unique companies and industries, job titles, required education and experience backgrounds, and competitiveness level, the cluster stands out as a group with potential for growth and advancement.

The name "Promising Analytics Ascent" conveys the cluster's potential for future prominence and highlights its unique position as a rising force in the analytics job market. It suggests that the cluster's occupations are on an upward trajectory, with bright prospects for career opportunities and development.

To visualize the grouping of the cluster, below is the treemap for each of the clusters.

In [ ]:
enable_plotly_in_cell()

scaled.reset_index(inplace=True)
a = scaled[['occupation','Cluster']]

# Create the treemap using Plotly
fig = px.treemap(a, 
                 path=['Cluster', 'occupation'],
                 title='Occupation Cluster Treemap')

fig.show()

Visualizations of the kMeans Cluster¶

1. Cluster Benefit (Salary & Wage) vs Skillset Requirements

In [ ]:
enable_plotly_in_cell()

data_to_plot = scaled[['occupation', 'mean_annual_salary', 'mean_max_salary', 'mean_min_salary', 'mean_wage', 'number_unique_common_skills', 'number_unique_software_skills', 'number_unique_specialized_skills', 'Cluster']]

# Calculate the sum of the three columns and add it as a new column
data_to_plot['total_unique_skills'] = data_to_plot['number_unique_common_skills'] + data_to_plot['number_unique_software_skills'] + data_to_plot['number_unique_specialized_skills']

# Create the interactive scatter plot with 'occupation' as labels and 'Cluster' as color
fig = px.scatter(data_to_plot, x='mean_annual_salary', y='total_unique_skills', color='Cluster', hover_name='occupation', hover_data=['Cluster'], color_continuous_scale='Viridis')

# Set the initial title and labels for the plot
fig.update_layout(title='Mean Annual Salary vs Total Unique Skills', xaxis_title='Mean Annual Salary', yaxis_title='Total Unique Skills')

# Create a dropdown menu for selecting the x-axis column
dropdown_buttons = []
for col in ['mean_annual_salary', 'mean_max_salary', 'mean_min_salary', 'mean_wage']:
    dropdown_buttons.append({
        'args': [
            {'x': [data_to_plot[col]], 'hover_data': [data_to_plot['occupation'], data_to_plot['Cluster']]},
            {'xaxis.title': col.replace('_', ' ').title(), 'title': f'{col.replace("_", " ").title()} vs Total Unique Skills'}
        ],
        'label': col.replace('_', ' ').title(),
        'method': 'update'
    })

# Add the dropdown menu to the plot and adjust the position slightly below the title
fig.update_layout(
    updatemenus=[{'buttons': dropdown_buttons, 'direction': 'down', 'showactive': True, 'x': 0.01, 'xanchor': 'left', 'y': 0.999, 'yanchor': 'top'}]
)

# Update hovertemplate to show the correct x-axis label dynamically
hover_template = 'Occupation: %{customdata[0]}<br>' + 'Cluster: %{customdata[1]}<br>' + '%{xaxis.title.text}: %{x}<br>' + 'Total Unique Skills: %{y}'
fig.update_traces(hovertemplate=hover_template, customdata=data_to_plot[['occupation', 'Cluster']])

fig.update_layout(coloraxis_showscale=False)

# Show the plot
fig.show()

2. Cluster Benefit vs Competitiveness

In [ ]:
enable_plotly_in_cell()

data_to_plot = scaled[['occupation', 'mean_annual_salary', 'mean_max_salary', 'mean_min_salary', 'mean_wage', 'number_unique_common_skills', 'number_unique_software_skills', 'number_unique_specialized_skills', 'Cluster', 'ratio']]

# Calculate the sum of the three columns and add it as a new column
data_to_plot['total_unique_skills'] = data_to_plot['number_unique_common_skills'] + data_to_plot['number_unique_software_skills'] + data_to_plot['number_unique_specialized_skills']

# Create the interactive scatter plot with 'occupation' as labels and 'Cluster' as color
fig = px.scatter(data_to_plot, x='mean_annual_salary', y='ratio', color='Cluster', hover_name='occupation', hover_data=['Cluster'], color_continuous_scale='Viridis')

# Set the initial title and labels for the plot
fig.update_layout(title='Mean Annual Salary vs Competitiveness', xaxis_title='Mean Annual Salary', yaxis_title='Competitiveness')

# Create a dropdown menu for selecting the x-axis column
dropdown_buttons = []
for col in ['mean_annual_salary', 'mean_max_salary', 'mean_min_salary', 'mean_wage']:
    dropdown_buttons.append({
        'args': [
            {'x': [data_to_plot[col]], 'y': [data_to_plot['ratio']], 'hover_data': [data_to_plot['occupation'], data_to_plot['Cluster']]},
            {'xaxis.title': col.replace('_', ' ').title(), 'yaxis.title': 'Competitiveness', 'title': f'{col.replace("_", " ").title()} vs Competitiveness'}
        ],
        'label': col.replace('_', ' ').title(),
        'method': 'update'
    })

# Add the dropdown menu to the plot and adjust the position slightly below the title
fig.update_layout(
    updatemenus=[{'buttons': dropdown_buttons, 'direction': 'down', 'showactive': True, 'x': 0.08, 'xanchor': 'left', 'y': 0.99, 'yanchor': 'top'}]
)

# Update hovertemplate to show the correct x-axis and y-axis label dynamically
hover_template = 'Occupation: %{customdata[0]}<br>' + 'Cluster: %{customdata[1]}<br>' + 'X-axis: %{xaxis.title.text}: %{x}<br>' + 'Y-axis: %{yaxis.title.text}: %{y}'
fig.update_traces(hovertemplate=hover_template, customdata=data_to_plot[['occupation', 'Cluster']])

fig.update_layout(coloraxis_showscale=False)

# Show the plot
fig.show()

3. Cluster Competing Company vs Industry

In [ ]:
enable_plotly_in_cell()

fig = px.scatter(scaled, x='number_unique_company', y='number_unique_industry', color='Cluster',
                 hover_name='occupation', hover_data=['Cluster'])

# Set the initial title and labels for the plot
fig.update_layout(title='Number of Unique Companies vs Number of Unique Industries',
                  xaxis_title='Number of Unique Companies', yaxis_title='Number of Unique Industries')

fig.update_layout(coloraxis_showscale=False)

# Show the plot
fig.show()

3. The Rest of the Features Combination

In [ ]:
enable_plotly_in_cell()

# Create the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    dcc.Dropdown(
        id='x-axis-dropdown',
        options=[{'label': col, 'value': col} for col in scaled.columns],
        value='mean_annual_salary',  # Default X-axis column
    ),
    dcc.Dropdown(
        id='y-axis-dropdown',
        options=[{'label': col, 'value': col} for col in scaled.columns],
        value='number_unique_common_skills',  # Default Y-axis column
    ),
    dcc.Graph(id='scatter-plot'),
])

# Define the callback to update the scatter plot
@app.callback(
    dash.dependencies.Output('scatter-plot', 'figure'),
    [dash.dependencies.Input('x-axis-dropdown', 'value'),
     dash.dependencies.Input('y-axis-dropdown', 'value')]
)
def update_scatter_plot(x_axis_column, y_axis_column):
    if not x_axis_column or not y_axis_column:
        # If either of the dropdowns is not selected, return an empty figure
        return px.scatter()

    fig = px.scatter(scaled, x=x_axis_column, y=y_axis_column, color='Cluster', hover_name='occupation')

    # Set the updated title and labels for the plot
    fig.update_layout(
        title=f'{x_axis_column} vs {y_axis_column} Colored by Cluster',
        xaxis_title=x_axis_column,
        yaxis_title=y_axis_column,
        hovermode='closest',  # To display the tooltip for the closest point
    )
    
    fig.update_layout(coloraxis_showscale=False)

    return fig

if __name__ == '__main__':
    app.run_server(debug=True)

Association Rule Analysis¶

In this section, our aim is to identify the most common combinations of skills requested in job postings for analytics-related occupations. To achieve this, we'll use Association Rule analysis. This analysis will provide us valuable insights into the skill sets that employers highly desire.

We'll be working with Job Posting Samples data from the past decade, covering the top 10 analytics-related occupations.

Before we dive into the Association Rule analysis, it is better to review some of the core concepts and terminologies of the Association Rule:

  1. Antecedent and Consequent (Precedent): Antecedent is the itemset that occurred/listed/mentioned/bought before the Consequent itemset.

  2. Support: support is a measure of how frequently an itemset (combination of items) appears in the dataset. It is the proportion of transactions that contain the itemset. High support values indicate that the itemset appears frequently and is more common in the dataset.

  3. Confidence: Confidence is a measure of how often the consequent (right-hand side) of a rule appears in transactions given that the antecedent (left-hand side) is present. It quantifies the strength of the association between the antecedent and consequent.

  4. Lift: Lift is a measure of the strength of the association between the antecedent and consequent, taking into account the expected co-occurrence if the items were independent.

  5. Leverage: Leverage is a measure of how much the rule contributes to the support above what would be expected if the antecedents and consequents were independent.

  6. Conviction: Conviction quantifies the implication of the rule in terms of how much it is influenced by the absence of the consequents.

  7. Zhang`s Metric: Zhang's metric is a measure of the degree to which the rule is interesting and different from what would be expected by chance.

1. Preparing the Dataset for Association Rule Analysis

In [ ]:
# Extract the 'Skills' column from the DataFrame
skills = job_posting_samples[['Skills']]

# Create a new column 'Transaction' and fill it with the transaction number
skills['Transaction'] = skills.apply(lambda x: f'Transaction {x.name + 1}', axis=1)

# Move the 'Transaction' column to the first position
column_order = ['Transaction'] + skills.columns[:-1].tolist()
skills = skills[column_order]
In [ ]:
# Create a new DataFrame with individual skill entries for each transaction
new_data = []
for _, row in skills.iterrows():
    skills = row['Skills'].split(', ')
    for skill in skills:
        new_data.append([row['Transaction'], skill])

new_skill_df = pd.DataFrame(new_data, columns=['Transaction', 'Skill'])
In [ ]:
# Convert the data into a transactional format using TransactionEncoder
te = TransactionEncoder()
te_array = te.fit(new_skill_df.groupby('Transaction')['Skill'].apply(list)).transform(new_skill_df.groupby('Transaction')['Skill'].apply(list))
transactional_data = pd.DataFrame(te_array, columns=te.columns_)

2. Applying Apriori Algorithm to the Transaction Data and Geenerate Association Rule

In [ ]:
# Apply Apriori algorithm to discover frequent itemsets
min_support = 0.1  
frequent_itemsets = apriori(transactional_data, min_support=min_support, use_colnames=True)
In [ ]:
# Generate association rules based on confidence
min_confidence = 0.5  # You can adjust this confidence threshold as needed
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=min_confidence)
In [ ]:
# Display the frequent itemsets and association rules
print("Frequent Itemsets:")
print(frequent_itemsets.sort_values(by='support', ascending=False))
In [ ]:
print("\nAssociation Rules:")
rules

3. Filtering the Rules That has Strong Association

In order to focus the insight extraction using Association Rules, we need to limit and prioritize the analysis on the set of rules that has strong association. Strong association here is arbitrarily defined by the analyst or model creator given the nature of the data.

To extract the combinations of skills that has strong association, we will set the threshold to the Association Rule algorithm to only display the combinations of skills with the following metrics:

  1. Minimum Support = 0.1 --> meaning that only itemsets that appear in at least 10% of the transactions will be used for rule mining.

  2. Minimum Confidence = 0.2 --> meaning that only rules with a confidence of at least 20% will be selected.

  3. Minimum Lift = 1 --> meaning that only rules with a lift greater than 1 (indicating a positive association) will be selected.

In [ ]:
# Set the thresholds for support, confidence, and lift
min_support_threshold = 0.1  # Adjust this value as needed
min_confidence_threshold = 0.2  # Adjust this value as needed
min_lift_threshold = 1

# Filter the rules based on the specified thresholds
filtered_rules = rules[
    (rules['support'] >= min_support_threshold) &
    (rules['confidence'] >= min_confidence_threshold) &
    (rules['lift']>= min_lift_threshold)
]

# Sort the rules by support and confidence in descending order
sorted_rules = filtered_rules.sort_values(by=['lift'], ascending=False)
sorted_rules.reset_index(drop=True, inplace=True)

# Display the selected and sorted rules
sorted_rules

4. Interpretation of Association Rule Result

In order to comprehend the output of the Association Rule Analysis, we will try to elaborate and interpret the result of rules with index 0 with rules of IF {R Programming Language} THEN {Python Programming Language} that has the highest Lift value that indicates strong association between the two skills.

Antecedents and Consequents:

  • The antecedents of the rule are the skills or items that appear on the left-hand side, and in this case, it is '(R (Programming Language))'.
  • The consequents of the rule are the skills or items that appear on the right-hand side, and in this case, it is '(Python (Programming Language))'.
  • The rule implies that there is an association between job postings that mention 'R (Programming Language)' and job postings that mention 'Python (Programming Language)'.

Support:

  • Support represents the frequency of occurrence of the rule in the dataset, i.e., how often both '(R (Programming Language))' and '(Python (Programming Language))' appear together in job postings.
  • In this case, the support is 0.116, indicating that the rule applies to approximately 11.6% of the job postings in the dataset.

Confidence:

  • Confidence measures the conditional probability of finding the consequents ('Python (Programming Language)') in a transaction given that the antecedents ('R (Programming Language)') are present.
  • In this case, the confidence is 0.828571, suggesting that when 'R (Programming Language)' is mentioned in a job posting, there is an 82.86% probability that 'Python (Programming Language)' will also be mentioned in the same job posting.

Lift:

  • Lift measures the strength of association between the antecedents and consequents, taking into account the expected co-occurrence if the items were independent.
  • A lift greater than 1 indicates a positive association. In this case, the lift is 4.022191, meaning that the co-occurrence of 'R (Programming Language)' and 'Python (Programming Language)' is approximately 4.02 times more likely than if they were independent of each other.

Leverage:

  • Leverage is a measure of how much the rule contributes to the support above what would be expected if the antecedents and consequents were independent.
  • In this case, the leverage is 0.087160, indicating that the rule contributes a positive impact on the support.

Conviction:

  • Conviction quantifies the implication of the rule in terms of how much it is influenced by the absence of the consequents.
  • A conviction value greater than 1 suggests a stronger implication. In this case, the conviction is 4.631667, indicating a positive implication.

Zhang's Metric:

  • Zhang's metric is a measure of the degree to which the rule is interesting and different from what would be expected by chance.
  • Higher values of Zhang's metric suggest more interesting and unexpected rules. In this case, the Zhang's metric is 0.873697.

Overall, this association rule suggests a strong association between job postings that mention 'R (Programming Language)' and job postings that mention 'Python (Programming Language)'. When 'R (Programming Language)' is mentioned, there is a relatively high likelihood of 'Python (Programming Language)' being mentioned as well, and the co-occurrence of these skills is more likely than expected by chance.

In [ ]: